[webgpu] Enable indirect dispatch for flash attention by qjia7 · Pull Request #26207 · microsoft/onnxruntime

qjia7 · 2025-09-30T08:19:57Z

This pull request introduces support for indirect dispatch in the WebGPU FlashAttention implementation, enabling more dynamic and efficient kernel launches based on runtime sequence lengths. The changes add new logic and parameters to propagate sequence length information and indirect dispatch buffers through the attention pipeline, with conditional code paths to maintain compatibility with the existing direct dispatch approach.

It's part of the work to enable graph capture in phi4 #25868

onnxruntime/contrib_ops/webgpu/bert/flash_attention.h

qjia7

Thanks for your comments @sushraja-msft. My reply inserted. I am going to merge this PR to unblock my following work as we discussed offline. @sushraja-msft @fs-eire Please continue your review if you have more comments. I will follow up them separately.

onnxruntime/contrib_ops/webgpu/bert/flash_attention.h

This pull request introduces support for indirect dispatch in the WebGPU FlashAttention implementation, enabling more dynamic and efficient kernel launches based on runtime sequence lengths. The changes add new logic and parameters to propagate sequence length information and indirect dispatch buffers through the attention pipeline, with conditional code paths to maintain compatibility with the existing direct dispatch approach. It's part of the work to enable graph capture in phi4 #25868

This pull request introduces support for indirect dispatch in the WebGPU FlashAttention implementation, enabling more dynamic and efficient kernel launches based on runtime sequence lengths. The changes add new logic and parameters to propagate sequence length information and indirect dispatch buffers through the attention pipeline, with conditional code paths to maintain compatibility with the existing direct dispatch approach. It's part of the work to enable graph capture in phi4 microsoft#25868

qjia7 added 2 commits September 30, 2025 15:39

Enable indirect dispatch for flash attention

cb37e18

TODOs and code clean up

0a94dcb

qjia7 marked this pull request as ready for review September 30, 2025 12:27

qjia7 requested review from fs-eire, guschmue and sushraja-msft September 30, 2025 12:27

guschmue added the ep:WebGPU ort-web webgpu provider label Oct 1, 2025

guschmue approved these changes Oct 13, 2025

View reviewed changes

sushraja-msft reviewed Oct 14, 2025

View reviewed changes

onnxruntime/contrib_ops/webgpu/bert/flash_attention.h Show resolved Hide resolved

sushraja-msft reviewed Oct 14, 2025

View reviewed changes

onnxruntime/contrib_ops/webgpu/bert/flash_attention.h Show resolved Hide resolved

qjia7 commented Oct 14, 2025

View reviewed changes

onnxruntime/contrib_ops/webgpu/bert/flash_attention.h Show resolved Hide resolved

qjia7 merged commit cd4ac49 into main Oct 14, 2025
92 checks passed

qjia7 deleted the fa_indirect_dispatch branch October 14, 2025 07:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[webgpu] Enable indirect dispatch for flash attention#26207

[webgpu] Enable indirect dispatch for flash attention#26207
qjia7 merged 2 commits intomainfrom
fa_indirect_dispatch

qjia7 commented Sep 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

qjia7 left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

qjia7 commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qjia7 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

qjia7 commented Sep 30, 2025 •

edited

Loading